- Vulnerability management simplified: The core essentials
- Velocità vs lentezza: ecco quale sarà il reale impatto dei tempi di adozione dell’AI
- Introducing Personal Data Cleanup | McAfee Blog
- OpenAI's Sora generates 10 videos per second and here are the top 5 cities
- AI-powered automation set for gains in 2025
'Humanity's Last Exam' benchmark is stumping top AI models – can you do any better?
Are artificial intelligence (AI) models really surpassing human ability? Or are current tests just too easy for them?
On Thursday, Scale AI and the Center for AI Safety (CAIS) released Humanity’s Last Exam (HLE), a new academic benchmark aiming to “test the limits of AI knowledge at the frontiers of human expertise,” Scale AI said in a release. The test consists of 3,000 text and multi-modal questions on more than 100 subjects like math, science, and humanities, submitted by experts in a variety of fields.
Also: Roll over, Darwin: How Google DeepMind’s ‘mind evolution’ could enhance AI thinking
Anthropic’s Michael Gerstenhaber, head of API technologies, noted to Bloomberg last fall that AI models frequently outpace benchmarks (part of why the Chatbot Arena leaderboard changes so rapidly when new models are released). For example, many LLMs now score over 90% on multi-task language understanding (MMLU), a commonly used benchmark. This is known as benchmark saturation.
By contrast, Scale reported that current models only answered less than 10 percent of the HLE benchmark’s questions correctly.
Researchers from the two organizations collected over 70,000 questions for HLE initially, narrowing them to 13,000 that were reviewed by human experts and then distilled once more into the final 3,000. They tested the questions on top models like OpenAI’s o1 and GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5 Pro alongside the MMLU, MATH, and GPQA benchmarks.
“When I released the MATH benchmark — a challenging competition mathematics dataset — in 2021, the best model scored less than 10%; few predicted that scores higher than 90% would be achieved just three years later,” said Dan Hendrycks, CAIS co-founder and executive director. “Right now, Humanity’s Last Exam shows that there are still some expert closed-ended questions that models are not able to answer. We will see how long that lasts.”
Also: DeepSeek’s new open-source AI model can outperform o1 for a fraction of the cost
Scale and CAIS gave contributors cash prizes for the top questions: $5,000 went to each of the top 50, while the next best 500 received $500. Although the final questions are now public, the two organizations kept another set of questions private to address “model overfitting,” or when a model is so closely trained to a dataset that it is unable to make accurate predictions on new data.
The benchmark’s creators note that they are still accepting test questions, but will no longer award cash prizes, though contributors are eligible for co-authorship.
CAIS and Scale AI plan to release the dataset to researchers so that they can further study new AI systems and their limitations. You can view all benchmark and sample questions at lastexam.ai.